TMop: a Tool for Unsupervised Translation Memory Cleaning
نویسندگان
چکیده
We present TMop, the first open-source tool for automatic Translation Memory (TM) cleaning. The tool implements a fully unsupervised approach to the task, which allows spotting unreliable translation units (sentence pairs in different languages, which are supposed to be translations of each other) without requiring labeled training data. TMop includes a highly configurable and extensible set of filters capturing different aspects of translation quality. It has been evaluated on a test set composed of 1,000 translation units (TUs) randomly extracted from the English-Italian version of MyMemory, a large-scale public TM. Results indicate its effectiveness in automatic removing “bad” TUs, with comparable performance to a state-of-the-art supervised method (76.3 vs. 77.7 balanced accuracy).
منابع مشابه
An Unsupervised Method for Automatic Translation Memory Cleaning
We address the problem of automatically cleaning a large-scale Translation Memory (TM) in a fully unsupervised fashion, i.e. without human-labelled data. We approach the task by: i) designing a set of features that capture the similarity between two text segments in different languages, ii) use them to induce reliable training labels for a subset of the translation units (TUs) contained in the ...
متن کاملBilingual Data Cleaning for SMT using Graph-based Random Walk
The quality of bilingual data is a key factor in Statistical Machine Translation (SMT). Low-quality bilingual data tends to produce incorrect translation knowledge and also degrades translation modeling performance. Previous work often used supervised learning methods to filter lowquality data, but a fair amount of human labeled examples are needed which are not easy to obtain. To reduce the re...
متن کاملUEdin participation in the 1st Translation Memory Cleaning Shared Task
We present our submission for the 1st Translation Memory Cleaning Shared Task. We treat the task as a 3-class classification problem and extract features that indicate (i) source sentence complexity, (ii) misalignments between source and target, and (iii) target sentence complexity. Our results show that focusing on the target side and finding ways to estimate the alignment quality between sour...
متن کاملAutomatic TM Cleaning through MT and POS Tagging: Autodesk's Submission to the NLP4TM 2016 Shared Task
We describe a machine learning based method to identify incorrect entries in translation memories. It extends previous work by Barbu (2015) through incorporating recall-based machine translation and part-of-speech-tagging features. Our system ranked first in the Binary Classification (II) task for two out of three language pairs: English–Italian and English–Spanish.
متن کاملA comparative evaluation of outlier detection algorithms: Experiments and analyses
We survey unsupervised machine learning algorithms in the context of outlier detection. This task challenges state-of-the-art methods from a variety of research fields to applications including fraud detection, intrusion detection, medical diagnoses and data cleaning. The selected methods are benchmarked on publicly available datasets and novel industrial datasets. Each method is then submitted...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2016